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Abstract 

The paper introduces the butterfly factorization as a data-sparse approximation for the ma¬ 
trices that satisfy a complementary low-rank property. The factorization can be constructed 
efficiently if either fast algorithms for applying the matrix and its adjoint are available or the 
entries of the matrix can be sampled individually. For an N x N matrix, the resulting factoriza¬ 
tion is a product of 0(log N) sparse matrices, each with O(N) non-zero entries. Hence, it can 
be applied rapidly in 0{N log N) operations. Numerical results are provided to demonstrate 
the effectiveness of the butterfly factorization and its construction algorithms. 

Keywords. Data-sparse matrix, butterfly algorithm, randomized algorithm, matrix factoriza¬ 
tion, operator compression, Fourier integral operators, special functions. 
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1 Introduction 

One of the key problems in scientific computing is the rapid evaluation of dense matrix-vector 
multiplication. Given a matrix K G C NxN and a vector g G C N , the direct computation of the 
vector u = Kg G C N takes 0(N 2 ) operations since each entry of K contributes to the result. 
This type of dense multiplication problem appears widely in special function transforms, integral 
transforms and equations, data fitting and smoothing, etc. Since the number of unknowns N 
is typically quite large in these applications, a lot of work has been devoted to performing this 
computation more efficiently without sacrificing the accuracy. Such a reduction in computational 
complexity depends highly on the algebraic and numerical properties of the matrix K. For certain 
types of matrices K, such as the Fourier matrix, numerically low-rank matrices, hierarchically 
semi-separable (HSS) matrices [20], and hierarchical matrices [U'Lj, there exist fast algorithms for 
computing Kg accurately in 0(N log N) or even 0(N ) operations. 

1.1 Complementary low-rank matrices and butterfly algorithm 

Recent work in this area has identified yet another class of matrices for which fast 0(N log N) 
application algorithms are available. These matrices satisfy a special kind of complementary low- 
rank property. For such a matrix, the rows are typically indexed by a set of points, say X, and 
the columns by another set of points, say R. Both X and R are often point sets in for some 
dimension d. Associated with X and R are two trees Tx and Tq, respectively and both trees are 
assumed to have the same depth L = O(logA), with the top level being level 0 and the bottom 
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Figure 1: Trees of the row and column indices. Left: Tx for the row indices X. Right: Tq for the 
column indices 12. The interaction between A E T\- and B E Tq starts at the root of Tx and the 
leaves of Tq. 

one being level L. Such a matrix K of size N x N is said to satisfy the complementary low- 
rank property if for any level l, any node A in Tx at level £, and any node B in Tq at level 
L — £, the submatrix Ka.Bi obtained by restricting K to the rows indexed by the points in A 
and the columns indexed by the points in B, is numerically low-rank, i.e., for a given precision e 
there exists a low-rank approximation of Ka,b with the 2-norm error bounded by e and the rank 
bounded polynomially in log IV and log(l/e). In many applications, one can even show that the 
rank is only bounded polynomially in log(l/e) and is independent of N . While it is straightforward 
to generalize the concept of the complementary low-rank property to a matrix with different row 
and column dimensions, the following discussion is restricted to the square matrices for simplicity. 

A simple yet important example is the Fourier matrix K of size N x N, where 

x = n = {o,...,N -l}, 

I< = (exp(2mjk/N)) 0 ^ j k<N . 

Here the trees T\ and Tq are generated by bisecting the sets X and 12 recursively. Both trees have 
the same depth L = log 2 N. For each pair of nodes A E T\ and B E Tq with A at level i and B 
at level L — £, the numerical rank of the submatrix Ka,b for a fixed precision e is bounded by a 
number that is independent of N and scales linearly with respect to log(l/e) [14| . 





Column Index 


Column Index 


Figure 2: Hierarchical decomposition of the row and column indices of a 16 x 16 matrix. The 
trees Tx and Tq have roots containing 16 column and row indices and leaves containing a single 
column and row index. The rectangles above indicate the submatrices satisfying the complementary 
low-rank property. 

For complementary low-rank matrices, the matrix-vector multiplication can be carried out ef¬ 
ficiently via the butterfly algorithm, which was initially proposed in m and later extended in 
mi- For a general matrix K of this type, the butterfly algorithm consists of two stages: the off-line 
stage and the on-line stage. In the off-line stage, it conducts simultaneously a top down traversal 
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of Tx and a bottom up traversal of T'o (see Figure [T] for an interpretation of data flows) to recur¬ 
sively compress all complementary low-rank submatrices (see Figure [ 2 ] for an example of necessary 
submatrices). This typically takes 0(N 2 ) operations [141116] for a general complementary low-rank 
matrix K. In the on-line stage, the butterfly algorithm then evaluates u = Kg for a given input 
vector g E C N in 0(N log N) operations. While the on-line application cost is essentially linear, the 
0(N 2 ) off-line precomputation cost appears to be a major bottleneck for many calculations. For 
certain complementary low-rank matrices, such as the ones obtained from the Fourier integral op¬ 
erators (FIOs) [HIH1E5J, th e sparse Fourier transforms [23] . and the numerical solutions of acoustic 
wave equations [2], the off-line precomputation cost can be reduced to nearly linear or even totally 
eliminated. However, in all these cases, the reduction heavily relies on strong assumptions on the 
analytic properties of the kernel function of K. When such detailed information is not available, 
we are then forced to fall back on the 0(N 2 ) off-line precomputation algorithm. 

1.2 Motivations and significance 

A natural question is whether it is still possible to reduce the cost of the precomputation stage if 
the analytic properties of the kernel are not accessible. The following two cases are quite common 
in applications: 

1. Only black-box routines for computing Kg and K*g in 0(N log N) operations are given. 

2. Only a black-box routine for evaluating any entry of the matrix K in 0(1) operations is given. 

To answer this question, this paper proposes the butterfly factorization, which represents K 
as a product of L + 3 sparse matrices: 

K « U l G l ~ 1 • • • G h M h (H h )* ■ ■ • ( H (1) 

where the depth L = O(log IV) of Tx and To is assumed to be even, h = L/2 is a middle level 
index, and all factors are sparse matrices with O(N) nonzero entries. 

The construction of the butterfly factorization proceeds as follows in two stages. The first stage 
is to construction a preliminary middle level factorization that is associated with the middle level 
of T x and To 

k « u h M h (v h )* , ( 2 ) 

where U h and V h are block diagonal matrices and M h is a weighted permutation matrix. In the 
first case, this is achieved by applying K to a set of 0{N 1 / 2 ) structured random vectors and then 
applying the randomized singular value decomposition (SVD) to the result. This typically costs 
0(N 3 / 2 log N) operations. In the second case, ([ 2 ]) is built via the randomized sampling method 
proposed in 13 ED for computing approximate SVDs. This randomized sampling needs to make the 
assumption that the columns and rows of middle level blocks of K to be incoherent with respect 
to the delta functions and it typcially takes only 0(N 3 / 2 ) operations in practice. 

Once the middle level factorization ([ 2 ]) is available, the second stage is a sequence of truncated 
SVDs that further factorize each of U h and V h into a sequence of sparse matrices, resulting in 
the final factorization 0- The operation count of this stage is 0(N 3 / 2 ) and the total memory 
complexity for constructing butterfly factorization is 0(N 3 / 2 ). 

When the butterfly factorization 0 is constructed, the cost of applying K to a given vector 
g E C N is 0(N log N) because ([!]) is a sequence of O(logA^) sparse matrices, each with 0(N) non¬ 
zero entries. Although we shall limit our discussion to one-dimensional problems in this paper, the 
proposed butterfly factorization, along with its construction algorithm, can be easily generalized 
to higher dimensions. 
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This work is motivated by problems that require repeated applications of a butterfly algorithm. 
In several applications, such as inverse scattering PH [22] and fast spherical harmonic transform 
(SHT) |T8] , the butterfly algorithm is called repeatedly either in an iterative process of minimizing 
some regularized objective function or to a large set of different input vectors. Therefore, it be¬ 
comes important to reduce the constant prefactor of the butterfly algorithm to save actual runtime. 
For example in [T], Chebyshev interpolation is applied to recover low-rank structures of submatri¬ 
ces with a sufficiently large number of interpolation points. The recovered rank is far from the 
optimum. Hence, the prefactor of the corresponding butterfly algorithm in pQ is large. The but¬ 
terfly factorization can further compress this butterfly algorithm to obtain nearly optimal low-rank 
approximations resulting in a much smaller prefactor, as will be shown in the numerical results. 
Therefore, it is more efficient to construct the butterfly factorization using this butterfly algorithm 
and then apply the butterfly factorization repeatedly. In this sense, the butterfly factorization can 
be viewed as a compression of certain butterfly algorithms. 

Another important application is the computation of a composition of several FIOs. A direct 
method to construct the composition takes 0(N 3 ) operations, while the butterfly factorization 
provides a data-sparse representation of this composition in 0(N 3 ^ 2 log N) operations, once the 
fast algorithm for applying each FIO is available. After the construction, the application of the 
butterfly factorization is independent of the number of FIOs in the composition, which is significant 
when the number of FIOs is large. 

Recently, there has also been a sequence of papers on recovering a structured matrix via applying 
it to (structured) random vectors. For example, the randomized SVD algorithms [6, 9j T9] recover 
a low-rank approximation to an unknown matrix when it is numerically low-rank. The work in 
|12j constructs a sparse representation for an unknown HSS matrix. More recently, [TOj considers 
the more general problem of constructing a sparse representation of an unknown "R-matrix. To 
our best knowledge, the present work is the first to address such matrix recovery problem if the 
unknown matrix satisfies the complementary low-rank property. 

1.3 Content 

The rest of this paper is organized as follows. Section [2] briefly reviews some basic tools that shall 
be used repeatedly in Sections [3] Section [3] describes in detail the butterfly factorization and its 
construction algorithm. In Section[4j numerical examples are provided to demonstrate the efficiency 
of the proposed algorithms. Finally, Section [5] lists several directions for future work. 


2 Preliminaries 


For a matrix Z £ C mxn , we define a rank-r approximate singular value decomposition (SVD) of Z 
as 


Z«t/ 0 S 0 V 0 *, 


where Uq 6 C mxr is unitary, So £ M rxr is diagonal, and Vo £ C nxr is unitary. A straightforward 
method to obtain the optimal rank-r approximation of Z is to compute its truncated SVD, where 
Uo is the matrix with the first r left singular vectors, So is a diagonal matrix with the first r singular 
values in decreasing order, and Vo is the matrix with the first r right singular vectors. 

A typical computation of the truncated SVD of Z takes 0(mn min(m, n)) operations, which 
can be quite expensive when m and n are large. Therefore, a lot of research has been devoted 
to faster algorithms for computing approximate SVDs, especially for matrices with fast decaying 
singular values. In Sections 2.1 and 2.2, we will introduce two randomized algorithms for computing 
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approximate SVDs for numerically low-rank matrices Z\ the first one [6] is based on applying the 
matrix to random vectors while the second one mm relies on sampling the matrix entries randomly. 

Once an approximate SVD Z « UqTqVq is computed, it can be written in several equivalent 
ways, each of which is convenient for certain purposes. First, one can write 

Z^USV*, 

where 

U = U 0 T 0 , S = Sq 1 and V* = S 0 V 0 *. (3) 

This construction is analogous to the well-known CUR decomposition m in the sense that the 
left and right factors in both factorization methods inherit similar singular values of the original 
numerical low-rank matrix. Here, the middle matrix S in ([3]) can be carefully constructed to ensure 
numerical stability, since the singular values in So can be computed to nearly full relative precision. 
As we shall see, sometimes it is also convenient to write the approximation as 

z ~ uv* 


where 


U = U 0 and V* = So Vo*, 

(4) 

U = Uq So and V* = Vo". 

(5) 


Here, one of the factors U and V share the singular values of Z. 

2.1 SVD via random matrix-vector multiplication 

One popular approach is the randomized algorithm in |6j that reduces the cubic complexity to 
0(rmn ) complexity. We briefly review this following [0] for constructing a rank-r approximation 
SVD Z « U 0 S 0 V 0 * below. 

Algorithm 2.1. Randomized SVD 

1. Generate two tall skinny random Gaussian matrices R co i E C nx ( r+p ) and R r0 w £ C mx ( T ' +p ), 
where p = 0(1) is an additive oversampling parameter that increases the approximation ac¬ 
curacy. 

2. Apply the pivoted QR factorization to ZR co i and let Q co i be the matrix of the first r columns 
of the Q matrix. Similarly, apply the pivoted QR factorization to Z*R row and let Q r0 w be the 
matrix of the first r columns of the Q matrix. 

3. Generate a tiny middle matrix M = Q* col ZQ row and compute its rank-r truncated SVD: 
M « Um’EmVm- 

4■ Let Uq = QcoiUm, So = S m, and V 0 * = Vl T Q* ow . Then Z « [7oS 0 V 0 *. 

The dominant complexity comes from the application of Z to 0{r ) random vectors. If fast 
algorithms for applying Z are available, the quadratic complexity can be further reduced. 

Once the approximate SVD of Z is ready, the equivalent forms in ([3]), Q, and ([5]) can be 
constructed easily. Under the condition that the singular values of Z decay sufficiently rapidly, the 
approximation error of the resulting rank-r is nearly optimal with an overwhelming probability. 
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Typically, the additive over-sampling parameter p = 5 is sufficient to obtain an accurate rank-r 
approximation of Z. 

For most applications, the goal is to construct a low-rank approximation up to a fixed relative 
precision e, rather than a fixed rank r. The above procedure can then be embedded into an iterative 
process that starts with a relatively small r, computes a rank-r approximation, estimates the error 
probabilistically, and repeats the steps with doubled rank 2r if the error is above the threshold e 

m- 

2.2 SVD via random sampling 

The above algorithm relies only on the product of the matrix Z E C mxn Q r its transpose with given 
random vectors. If one is allowed to access the individual entries of Z, the following randomized 
sampling method for low-rank approximations introduced in mm can be more efficient. This 
method only visits 0(r ) columns and rows of Z and hence only requires 0(r(m. + n)) operations 
and memory. 

Here, we adopt the standard notation for a submatrix: given a row index set I and a column 
index set J, Zj j = Z(I, J) is the submatrix with entries from rows in / and columns in J; we also 
use “ : ” to denote the entire columns or rows of the matrix, i.e., Zp- = Z(I ,:) and Z-j = Z(:, J). 
With these handy notations, we briefly introduce the randomized sampling algorithm to construct 
a rank-?’ approximation of Z ~ UqTjqVq . 

Algorithm 2.2. Randomized sampling for low-rank approximation 

1. Let n co i and H row denote the important columns and rows of Z that are used to form the 
column and row bases. Initially Ii co i = 0 and n rcm) = 0. 

2. Randomly sample rq rows and denote their indices by S row . Let I = S row U H r0 w Here 
q = 0(1) is a multiplicative oversampling parameter. Perform a pivoted QR decomposition 
ofZr ■ to qet 

Zp.P = QR , 

where P is the resulting permutation matrix and R = (r^) is an 0{r) x n upper triangular 
matrix. Define the important column index set H co i to be the first r columns picked within 
the pivoted QR decomposition. 

3. Randomly sample rq columns and denote their indices by S co i. Let J = S co i U n co /. Perform 
a pivoted LQ decomposition of Z lt j to get 

PZ.,j = LQ , 

where P is the resulting permutation matrix and L = (lij) is an m x O(r) lower triangular 
matrix. Define the important row index set Ii row to be the first r rows picked within the 
pivoted LQ decomposition. 

4- Repeat steps 2 and 3 a few times to ensure n co ; and Ii row sufficiently sample the important 
columns and rows of Z. 

5. Apply the pivoted QR factorization to Z-j\ col and let Q co i be the matrix of the first r columns 
of the Q matrix. Similarly, apply the pivoted QR factorization to ^n roii! . and let Q r0 w be the 
matrix of the first r columns of the Q matrix. 
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6. We seek a middle matrix M such that Z ~ Q C oiMQ*ow- To solve this problem efficiently, 
we approximately reduce it to a least-squares problem of a smaller size. Let S co i and S r0 w be 
the index sets of a few extra randomly sampled columns and rows. Let J = n co ; U S co i and 
I = II r ow U Srow A simple least-squares solution to the problem 

mm ||Zpj - ( Qcoi)i,-M{Q* row )„j\\ 

gives M = {Qcol)\ j> w here (-)t stands for the pseudo-inverse. 

7. Compute an SVD M ~ Um^mVm- Then the low-rank approximation of Z k, UqSqVq is 
given by 

Uo = QcoiU M , So = £m, Vf = V* M Q* row . (6) 

We have not been able to quantify the error and success probability rigorously for this procedure 
at this point. On the other hand, when the columns and rows of K are incoherent with respect 
to “delta functions” (i.e., vectors that have only one significantly larger entry), this procedure 
works well in our numerical experiments. Here, a vector u is said to be incoherent with respect 
to a vector v if n = |w r u|/(||w|| 2 ||u|| 2 ) is small. In the typical implementation, the multiplicative 
oversampling parameter q is equal to 3 and Steps 2 and 3 are iterated no more than three times. 
These parameters are empirically sufficient to achieve accurate low-rank approximations and are 
used through out numerical examples in Section |4j 

As we mentioned above, for most applications the goal is to construct a low-rank approximation 
up to a fixed relative error e, rather than a fixed rank. This process can also be embedded into an 
iterative process to achieve the desired accuracy. 

3 Butterfly factorization 

This section presents the butterfly factorization algorithm for a matrix K e C NxN . For simplicity 
let X = Q = {1,..., IV}. The trees Tx and Tq are complete binary trees with L = log 2 N — 0(1) 
levels. We assume that L is an even integer and the number of points in each leaf node of Tx and 
Xh is bounded by a uniform constant. 

At each level £, £ = 0,..., L, we denote the 7th node at level £ in Tx as A\ for i = 0,1,..., 2 s - — 1 
and the jth node at level L — £ in Tq as for j = 0,1,..., — 1. These nodes naturally 

partition K into 0(N) submatrices K , e b l-i. For simplicity, we write K- - := K b l-i, where 

* ’ j ® * 3 

the superscript is used to indicate the level (in Tx). The butterfly factorization utilizes rank-r 
approximations of all submatrices Kfj with r = 0(1). 

The butterfly factorization of K is built in two stages. In the first stage, we compute a rank-r 
approximations of each submatrix KC at the level £ = h = L/2 and then organize them into an 
initial factorization: 

K « U h M h (V h )*, 

where U h and V h are block diagonal matrices and M h is a weighted permutation matrix. This is 
referred as the middle level factorization and is described in detail in Section [3. II 

In the second stage, we recursively factorize U £ ~ U (+1 G e and {V 1 )* ~ (H e )* (V e+1 )* for 
£ = h, h + 1,..., L — 1, since U e and (V^)* inherit the complementary low-rank property from 
I\ , i.e., the low-rank property of U ( comes from the low-rank property of Kf ■ and the low-rank 
property of V e results from the one of After this recursive factorization, one reaches at the 
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the butterfly factorization of K 


K « U l G l - 1 ■ • • G h M h (H h )* • • • (7) 

where all factors are sparse matrices with O(N) nonzero entries. We refer to this stage as the 
recursive factorization and it is discussed in detail in Section [3.21 


3.1 Middle level factorization 

The first step of the middle level factorization is to compute a rank -r approximation to every KG . 
Recall that we consider one of the following two cases. 

1. Only black-box routines for computing Kg and K*g in 0(N log N) operations are given. 

2. Only a black-box routine for evaluating any entry of the matrix K in 0(1) operations is given. 


The actual computation of this step proceeds differently depending on which case is under consid¬ 
eration. Through the discussion, m = 2 h = 0(N 1 G) is the number of nodes in the middle level 
h = L/2 and we assume without loss of generality that N/m is an integer. 


• In the first case, the rank-r approximation of each KG is constructed with the SVD algorithm 
via random matrix-vector multiplication in Section 2.11 This requires us to apply KG and 
its adjoint to random Gaussian matrices of size (N/m) x (r + p), where r is the desired rank 
and p is an oversampling parameter. In order to take advantage of the fast algorithm for 
multiplying K , we construct a matrix C of size N x m(r + p). C is partitioned into an mxm 
blocks with each block C t j for i. j = 0,1,..., m — 1 of size (N/m) x (r + p). In additional, C 
is block-diagonal and its diagonal blocks are random Gaussian matrices. This is equivalent 
to applying each KG to the same random Gaussian matrix Cj 3 for all i. We then use the fast 
algorithm to apply K to each column of C and store the results. Similarly, we form another 
random block diagonal matrix R similar to C and use the fast algorithm of applying K* to 
R. This is equivalent to applying each (IlG)* to an (N/m) x (r+p) Gaussian random matrix 
Ra for all j = 0,1,..., m — 1. With KGCjj and (KG)*Rh ready, we can compute the rank -r 


approximate SVD of I\G following the procedure described in Section 


2.1 


In the second case, it is assumed that an arbitrary entry of K can be calculated in 0(1) 
operations. We simply apply the SVD algorithm via random sampling in Section 2.2 to each 
KG to construct a rank-r approximate SVD. 


In either case, once the approximate SVD of KG is ready, it is transformed in the form 


K 


h3 


rjh nh 




following Q. We would like to emphasize that the columns of IfG and Vj\ are scaled with the 
singular values of the approximate SVD so that they keep track of the importance of these columns 
in approximating KG . 





After calculating the approximate rank -r factorization of each K ^-, we assemble these factors 


into three block matrices U h , M h and V h as follows: 


K 


( KAA 




A-A-M- 1,0 T \ 
Utm-A-A-A 


\K-lA-lA V O h m-lY K-A-lAm-lY 


K 


uy 


\ ( Ko 

A 


V 

=U h M h (V h )*, 


<1 

A 


K,m-1 \ 

Km -1 


f{V t 


h\* 




h\* 


K-J Wm-1, 0 K-,.1 K-l,m-J \ 




( 8 ) 


where 


u? = K o u t ••• Ci) ecWm)XM - K = A v £i ■■■ V A) ecWm)xmr . (9) 


and M h E m 2 r)x(m 2 r ) is a weighted permutation matrix. Each submatrix iV/T is itself an m x m 
block matrix with block size r x r where all blocks are zero except that the (j, i) block is equal 
to the diagonal matrix S^. It is obvious that there are only 0(N ) nonzero entries in M h . See 
Figure [3] for an example of a middle level factorization of a 64 x 64 matrix with r = 1. 



Figure 3: The middle level factorization of a 64 x 64 complementary low-rank matrix K ~ 
U 3 M 3 (V 3 )* assuming r = 1. Grey blocks indicate nonzero blocks. U 3 and V 3 are block-diagonal 
matrices with 8 blocks. The diagonal blocks of U 3 and V 3 are assembled according to Equation © 
as indicated by black rectangles. M 3 is a 8 x 8 block matrix with each block Mfj itself an 8 x 8 
block matrix containing diagonal weights matrix on the (j, i) block. 


3.2 Recursive factorization 

In this section, we will recursively factorize 

U £ « U e+1 G l (10) 

for £ = h, h + 1,..., L — 1 and 

(v £ y « (H e y(v i+1 y (ii) 

for t = h, h + 1,..., L — 1. After these recursive factorizations, we can obtain the following butterfly 
factorization by substituting these factorizations into ([8]): 

K ^U L G L ~ l ■■■G h M h (H h )* (12) 
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3.2.1 Recursive factorization of U 


h 


Each factorization at level £ in (10) results from the low-rank property of Kf ■ for i > L/2. When 


£ = h, recall that 


U h = 


K 


\ 


uf 


ut-J 


and 


U?={U* o U* 1 ••• U^) 

with each G c( N / m ) xr , We split and each into halves by row, i.e., 

7 h,t\ 


U? = 


u 


u' 


h,b 


and U id = 


U h,t 

ho 

U h,b 


where the superscript t denotes the top half and b denotes the bottom half of a matrix. Then we 
have 


u b = 


Tjh,t Tjhjt 

U i,2j U i,2j+1 


/ jjh,t 

/ u i, 0 

U i, 1 ' ' ' 

TJ h,t \ 

■ u i,m-l \ 

1 U h,b 

\ u i, 0 


U h ’ b 1 

1 and 

T—1 

o 

II 

.. , m /2 — 1 


and ( U gj £/* 


h,b 

2j+l 


(13) 


(14) 


in (13) are in the column space of an d ^-2i+i j- respectively. By the complementary low- 

rank property of the matrix K , and j are numerical low-rank. Hence 

and (u^jU^j+iJ are numerically low-rank matrices in cW 2m ) x2r . Compute their rank-r ap¬ 
proximations by the standard truncated SVD, transform it into the form of ([5]) and denote them 


as 


u >A+i) * u S G k, (u,% u‘ 


h,b 

2j+l 


Tjh+l s-ih 
U 2i+l,j U 2i+l,j 


(15) 


for i = 0,1,..., m — 1 and j = 0,1,..., m/2 — 1. The matrices in (15) can be assembled into two 
new sparse matrices, such that 


(Ui 


h+1 


U h « U h+1 G h = 


U- 


h -\-1 


u h+1 / 

u 2m-l/ 


\ (G h , 

V 


G\ 


\ 


G h m -J 


where 


Tjh+l _ (Tjh +1 Tjh +1 . . . 77 / 1+1 \ 

U i — W i,0 U i, 1 U i,m/ 2-1 ) 
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for * = 0,1,, 2 m — 1, and 



for i = 0,1,..., m — 1. 

Since there are 0(1) nonzero entries in each 0(0 and there are O(N) such submatrices, there are 
only O(N) nonzero entries in G h . See Figure[4]top for an example of the factorization U h ~ JJ h+1 G h 
for the left factor U h with L = 6, h = 3 and r = 1 in Figure^ 



Figure 4: The recursive factorization of U 3 in Figure [ 3 J Gray factors are matrices inheriting the 
complementary low-rank property. Top: left matrix: U 3 with each diagonal block partitioned into 
smaller blocks according to Equation (13) as indicated by black rectangles; middle-left matrix: 
low-rank approximations of submatrices in U 3 given by Equation (15); middle right matrix: U 4 ; 
right matrix: G 3 . Bottom: U 4 in the first row is further factorized into U 4 ~ U 5 G 4 , giving 
U 3 « U 5 G 4 G 3 . 


Similarly, for any i between h and L — 1, we can factorize U e ~ U {+1 G^, because the columns 
in (Ui 2 jUi 2 j + \ \ and are in the column space of the numerically low-rank matrices 

it 2 tj an d ^ 2 i+\ ]■ respectively. Computing the rank-r approximations via the standard truncated 
SVD and transforming them into the form of ([5]) give 


U\ 


if, 


*, 2 j 


TT e,t 

U i,2j+1 


U^Gij and 


U. 


e,b 


U, 


i,b 


i,2j w i,2j+l 


a 


■£+1 


rJ 




(16) 


for i = 0,1,..., 2^ — 1 and j = 0,1,..., 2 L e 1 — 1. After assembling these factorizations together, 
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we obtain 


U l « U e+1 G l = 

(K +l 

U[ +1 

\ 


(Gi 

G[ 

\ 


V 

U e+1 i 
u 2 £ + 1 -l / 


V 



where 


u‘ +1 = (t/'; 


for * = 0,1,, 2 £+1 — 1, and 


/ <4o 


G = 


£+1 (jt+1 

0 u i, 1 


(it 

Lr 2i,l 


I7f 


£+1 




(it 


(it 

'~ T 2i+l,0 


(it 

LT 2i+l,l 


V 

for i = 0, l,...,2 e -l. 

After L — h steps of recursive factorizations 

U l 


G £ 


2i+l,2 L - t ~ 1 -l) 


U £+1 G l 


for £ = h, h + 1,..., L — 1, we obtain the recursive factorization of U h as 


U h 


U l G l ~ 1 • • • G h 


(17) 

See Figure [i] bottom for an example of a recursive factorization for the left factor U h with L = 6, 
h = 3 and r = 1 in Figure [3| 

Similar to the analysis of G h , it is also easy to check that there are only 0(N ) nonzero entries in 
each G £ in (17). Since there are 0(N ) diagonal blocks in U L and each block contains 0(1) entries, 


there is O(N) nonzero entries in U L . 

3.2.2 Recursive factorization of V h 

The recursive factorization of V h is similar to the one of U h . In each step of the factorization 

(v*)* «( H e y(v e+1 y , 

we take advantage of the low-rank property of the row space of - 1 and K^j+i to obtain 
rank-r approximations. Applying the exact same procedure of Section 3.2.1 now to V £ leads to the 
recursive factorization V h ~ V L H L ~ 1 ■ ■ ■ H h . or equivalently 


C v h y « {H n y • • • (v L y, 


rh\* 


L —1\* /t 


(18) 


with all factors containing only O(N) nonzero entries. See Figure [5] for an example of a recursive 
factorization ( V h )* « (H h )* ■ ■ ■ (H L ~ 2 )* (y L ~ 1 )* for the left factor V h with L = 6, h = 3 and r = 1 
in Figure [3] 
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Figure 5: The recursive factorization ( V 3 )* ~ (H 3 )* (H 4 )* (V 5 )* of (F 3 )* in Figure [ 3 J 


Given the recursive factorization of U h and ( V h )* in ( |17[ ) and (18), we reach the butterfly 
factorization 

■ G h M h (H h )* ■ ■ • ( H L - l )*{V L )\ (19) 


K « U l G l ~ 1 


where all factors are sparse matrices with 0(N) nonzero entries. For a given input vector g e C^, 
the 0{N 2 ) matrix-vector multiplication u = Kg can be approximated by a sequence of O(loglV) 
sparse matrix-vector multiplications given by the butterfly factorization. 


3.3 Complexity analysis 

The complexity analysis of the construction of a butterfly factorization naturally consists of two 
parts: the middle level factorization and the recursive factorization. 

The complexity of the middle level factorization depends on which one of the cases is under 
consideration. 


• For the first case, the approximate SVDs are determined by the application of K and K* to 
Gaussian random matrices in (^ NxNl/2 ( r +p) and the rank-r approximations of K^- for each 
(i, j) pair. Assume that each matrix-vector multiplication by K or K* via the given black-box 
routines requires 0(Ck(N )) operations (which is at least O(N)). Then the dominant cost is 
due to applying K and K* 0(N 1//2 ) times, which yields an overall computational complexity 
of 0{Ck{K)N 4 / 2 ). 


• In the second case, the approximate SVDs are computed via random sampling for each K^- 
of the O(N) pairs (i,j). The complexity of performing randomized sampling for each such 
block is 0(N 4 ^ 2 ). Hence, the overall computational complexity is Q{N 3 / 2 ). 


diagonal blocks of size 0(N/2 £ ) x 


In the recursive factorization, U 1 ' at level i consists of 0(2 e ) 

0{N/2 ( ). In each diagonal block, there are 0(N/2 e ) factorizations in (16). 
complexity of performing one factorization in ( |16| ) is 0(N/2 i ) 
factorize U Summing up the operations at all levels gives the total complexity for recursively 
factorizing U h : 


Since the operation 
it takes 0(N 2 /2 l ) operations to 


L—l 

£ 

l=h 


0{N 2 /2 l ) = 0{N 3 ' 2 ). 


( 20 ) 


Similarly, the operation complexity for recursively compressing V h is also 0(N 3 / 2 ). 

The memory peak of the butterfly factorization occurs in the middle level factorization since we 
have to store the initial factorization in ([8]). There are 0(N 3 / 2 ) nonzero entries in U h and V h 7 and 
0(N) in M h . Hence, the total memory complexity is 0(N 3//2 ). The total operation complexity for 
constructing the butterfly factorization is summarized in Table [T] 
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Randomized 

Randomized 



SVD 

sampling 


Middle level 
factorization 

0(C k {N)N 1/2 ) 

0(N 3 / 2 ) 

Factorization 

Recursive 

0(N 3 / 2 ) 

Complexity 

factorization 


Total 

0{C k {N)N 1 / 2 ) 

0{N 3 / 2 ) 

Memory 

Complexity 


0(N 3 / 2 ) 

(){N log N) 

Application 

Complexity 


0(N log N) 


Table 1: Computational complexity and memory complexity of the butterfly factorization. Ck(N) 
is the operation complexity of one application of K or K*. In most of the cases encountered, 
C k (N) = 0(N log N). 


It is worth pointing out that the memory complexity can be reduced to 0(N log N), when we 
apply the randomized sampling method to construct each block in the initial factorization in Q 
separately. Instead of factorizing U h and V h at the end of the middle level factorization, we can 
factorize the left and right factors U d and V- 1 in ([8]) on the fly to avoid storing all factors in ([8]) . For 
a fixed i, we generate U- 1 from for all j. and recursively factorize Uj 1 . The memory cost is 0(N ) 
for storing U- 1 and 0(N 1 ^ 2 log N) for storing the sparse matrices after its recursive factorization. 
Repeating this process for i = 1,, N 1 / 2 gives the complete factorization of U h . The factorization 
of V h is conducted similarly. The total memory complexity is 0(N log N). 

The operation and memory complexity for the application of the butterfly factorization are 
governed by the number of nonzero entries in the factorization: 0(N log N). 

4 Numerical results 

This section presents three numerical examples to demonstrate the effectiveness of the algorithms 
proposed above. The first example is an FIO in [1J and the second example is a special function 
transform in |14| . Both examples provide an explicit kernel function that becomes a one-dimensional 
complementary low-rank matrix after discretization. This allows us to apply the butterfly factor¬ 
ization construction algorithm with random sampling. The computational complexity and the 
memory cost are 0(N 3 / 2 ) and 0(N log N) in this case. 

The third example is a composition of two FIOs for which an explicit kernel function of their 
composition is not available. Since we can apply either the butterfly algorithm in [T] or the butterfly 
factorization to evaluate these FIOs one by one, a fast algorithm for computing the composition is 
available. We apply the butterfly factorization construction algorithm with random matrix-vector 
multiplication to this example which requires 0{N 3 / 2 log N) operations and 0(N 3//2 ) memory cost. 

Our implementation is in MATLAB. The numerical results were obtained on a server com¬ 
puter with a 2.0 GHz CPU. The additive oversampling parameter is p = 5 and the multiplicative 
oversampling parameter is q = 3. 

Let {u d {x) : x E X } and {u a (x),x E X} denote the results given by the direct matrix-vector 
multiplication and the butterfly factorization. The accuracy of applying the butterfly factorization 
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algorithm is estimated by the following relative error 


e a = 


E* 6 s !«“(*) -« d (*)l : 


E. 6 5 l« d (*)I 2 

where S' is a point set of size 256 randomly sampled from X. 


( 21 ) 


Example 1. Our first example is to evaluate a one-dimensional FIO of the following form: 

u(x) = [ (22) 

Jr 

where / is the Fourier transform of /, and 4>(x, £) is a phase function given by 

<J>(a:,£) = x ■ £ + c(x)|£|, c(x) = (2 + sin(27rx))/8. (23) 


The discretization of (22) is 


u(xi) = ]T i,j = l,2,...,N, 

£>3 


(24) 


where {xi} and {^} are uniformly distributed points in [0,1) and [—IV/2,IV/2) following 

Xi = (i — 1)/IV and £j = j — 1 — N/2. (25) 

(24) can be represented in a matrix form as u = Kg, where Ui = u(xi), Kij = and 


gj = The matrix K satisfies the complementary low-rank property as proved in [U[8]. The 

explicit kernel function of K allows us to use the construction algorithm with random sampling. 
Table [2] summarizes the results of this example for different grid sizes N and truncation ranks r. 

Example 2. Next, we provide an example of a special function transform. This example can be 
further applied to accelerate the Fourier-Bessel transform that is important in many real applica¬ 
tions. Following the standard notation, we denote the Hankel function of the first kind of order 
m by Hin ■ When m is an integer, H$ has a singularity at the origin and a branch cut along 
the negative real axis. We are interested in evaluating the sum of Hankel functions over different 
orders, 


N 


U( 


( x i) = y2 H j-l( x i)3j’ *=1,2,. 


Z_/ ■ 3 

3 = 1 


,N, 


(26) 


which is analogous to expansion in orthogonal polynomials. The points Xi are defined via the 
formula, 

2tt 

Xj. — X + •(;- 1) (27) 


which are bounded away from zero. It is demonstrated in m that (|26|) can be represented via 


u — Kg where K satisfies the complementary low-rank property, u t = u{xi) and Kjj = (x*). 

The entries of matrix K can be calculated efficiently and the construction algorithm with random 
sampling is applied to accelerate the evaluation of the sum (26). Table [3] summarizes the results of 
this example for different grid sizes N and truncation ranks r. 

From Table [2] and [3j we note that the accuracy of the butterfly factorization is well controlled 
by the max rank r. For a fixed rank r, the accuracy is almost independent of A*. In practical 
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N,r 

e a 

Tf actor (jYl%Tl) 

Td(sec) 

T a {sec ) 

Td/T a 

1024,4 

2.49e-05 

2.92e-01 

2.30e-01 

3.01e-02 

7.65e+00 

4096,4 

4.69e-05 

1.62e+00 

2.64e+00 

4.16e-02 

6.35e+01 

16384,4 

5.77e-05 

1 .22e+01 

2.28e+01 

1.84e-01 

1.24e+02 

65536,4 

6.46e-05 

8 .10e+01 

2.16e+02 

1 .02e+00 

2 .12e+02 

262144,4 

7.13e-05 

4.24e+02 

3.34e+03 

4.75e+00 

7.04e+02 

1024,6 

1.57e-08 

1.81e-01 

1.84e-01 

1 .20e-02 

1.54e+01 

4096,6 

3.64e-08 

1.55e+00 

2.56e+00 

6.42e-02 

3.98e+01 

16384,6 

6.40e-08 

1.25e+01 

2.43e+01 

3.01e-01 

8.08e+01 

65536,6 

6.53e-08 

9.04e+01 

2.04e+02 

1.77e+00 

1.15e+02 

262144,6 

6.85e-08 

5.45e+02 

3.68e+03 

8.62e+00 

4.27e+02 

1024,8 

5.48e-12 

1.83e-01 

1.78e-01 

1.63e-02 

1.09e+01 

4096,8 

1.05e-ll 

1.98e+00 

2.71e+00 

8.72e-02 

3.11e+01 

16384,8 

2.09e-ll 

1.41e+01 

3.34e+01 

5.28e-01 

6.33e+01 

65536,8 

2.62e-ll 

1.17e+02 

2 .10e+02 

2.71e+00 

7.75e+01 

262144,8 

4.13e-ll 

6.50e+02 

3.67e+03 

1.52e+01 

2.42e+02 


Table 2: Numerical results for the FIO given in (24). N is the size of the matrix; r is the fixed rank 
in the low-rank approximations; Tpactor is the factorization time of the butterfly factorization; Tj is 
the running time of the direct evaluation; T a is the application time of the butterfly factorization; 
Td/T a is the speedup factor. 


N, r 

e a 

Tf actor (min) 

Td(sec) 

Ta(sec) 

T d /T a 

1024,4 

2.35e-06 

8.78e-01 

8.30e-01 

1.06e-02 

7.86e+01 

4096,4 

5.66e-06 

5.02e+00 

5.30e+00 

2.83e-02 

1.87e+02 

16384,4 

6.86e-06 

3.04e+01 

5.51e+01 

1.16e-01 

4.76e+02 

65536,4 

7.04e-06 

2 .01e+02 

7.59e+02 

6.38e-01 

1.19e+03 

1024,6 

2.02e-08 

4.31e-01 

7.99e-01 

9.69e-03 

8.25e+01 

4096,6 

4.47e-08 

6.61e+00 

5.41e+00 

4.52e-02 

1 .20e+02 

16384,6 

5.95e-08 

4.19e+01 

5.62e+01 

1.61e-01 

3.48e+02 

65536,6 

7.86e-08 

2.76e+02 

7.60e+02 

1 .01e+00 

7.49e+02 


Table 3: Numerical results with the matrix given by (26). 
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applications, one can set the desired e ahead and increase the truncation rank r until the relative 
error reaches e. 

The tables for Example 1 and Example 2 also provide numerical evidence for the asymptotic 
complexity of the proposed algorithms. The construction algorithm based on random sampling 
is of computational complexity 0(1V 3 / 2 ). When we quadruple the problem size, the running time 
of the construction sextuples and is better than we expect. The reason is that in the random 
sampling method, the computation of a middle matrix requires pseudo-inverses of r x r matrices 
whose complexity is 0(r 3 ) with a large prefactor. Hence, when N is not large, the running time 
will be dominated by the 0(r 3 N ) computation of middle matrices. The numbers also show that the 
application complexity of the butterfly factorization is 0(iV logiV) with a prefactor much smaller 
than the butterfly algorithm with Chebyshev interpolation [lj. In example 1, when the relative error 
is e ~ 10 5 , the butterfly factorization truncates the low-rank submatrices with rank 4 whereas 
the butterfly algorithm with Chebyshev interpolation uses 9 Chebyshev grid points. The speedup 
factors are 200 on average. 


Example 3. In this example, we consider a composition of two FIOs, which is the discretization 
of the following operator 

u{x) = f e 2 *t* a (*,*?) f g-2-w f e 27n$i(s/,0 j(£) dydrj. (28) 

Jr J r Jr 


For simplicity, we consider the same phase function dq = dq = as given by (23). By the 
discussion of Example 1 for one FIO, we know the discrete analog of the composition (28) can be 
represented as 

u = KFKFf =: KFKg , with g = Ff, 


where F is the standard Fourier transform in matrix form, K is the same matrix as in Example 
1, Ui = u(xi), and gj = Under mild assumptions as discussed in [7], the composition of 

two FIOs is an FIO. Hence, the new kernel matrix K = KFK again satisfies the complementary 
low-rank property, though typically with slightly increased ranks. 

Notice that it is not reasonable to compute the matrix K directly. However, we have the 
fast Fourier transform (FFT) to apply F and the butterfly factorization that we have built for 
K in Example 1 to apply K. Therefore, the construction algorithm with random matrix-vector 
multiplication is applied to factorize K. 

Since the direct evaluation of each m takes 0(N 2 ) operations, the exact solution {uf}i £ s for 
a selected set S is infeasible for large N. We apply the butterfly factorization of K and the FFT 
to evaluate {rejigs as an approximation to the exact solution {uf}i^s- These approximations are 
compared to the results {uf}i^s that are given by applying the butterfly factorization of K. Tableji] 
summarizes the results of this example for different grid sizes N and truncation ranks r. 

Table [ 4 ] shows the numerical results of the butterfly factorization of K. The accuracy improves 
as we increase the truncation rank r. Comparing Table [4] with Table [2j we notice that, for a fixed 
accuracy, the rank used in the butterfly factorization of the composition of FIOs should be larger 
than the rank used in a single FIO butterfly factorization. This is expected since the composition is 
in general more complicated than the individual FIOs. Tp ac t or grows on average by a factor of ten 
when we quadruple the problem size. This agrees with the estimated 0(N 3//2 log N) computational 
complexity for constructing the butterfly factorization. The column T a shows that the empirical 
application time of our factorization is close to the estimated complexity 0(iV log N). 
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N,r 

e a 

Tf actor {min) 

T d {sec) 

Ta(sec) 

T d /T a 

1024,4 

1.40e-02 

3.26e-01 

3.64e-01 

4.74e-03 

7.69e+01 

4096,4 

1.96e-02 

4.20e+00 

6.59e+00 

2.52e-02 

2.62e+02 

16384,4 

2.34e-02 

4.65e+01 

3.75e+01 

1.15e-01 

3.25e+02 

65536,4 

2.18e-02 

4.33e+02 

3.73e+02 

6.79e-01 

5.49e+02 

1024,8 

6.62e-05 

3.65e-01 

3.64e-01 

8.25e-03 

4.42e+01 

4096,8 

8.67e-05 

4.94e+00 

6.59e+00 

5.99e-02 

1 .10e+02 

16384,8 

1.43e-04 

6.23e+01 

3.75e+01 

3.47e-01 

1.08e+02 

65536,8 

1.51e-04 

6.91e+02 

3.73e+02 

1.76e+00 

2 .12e+02 

1024,12 

1.64e-08 

4.79e-01 

3.64e-01 

1.48e-02 

2.46e+01 

4096,12 

1.05e-07 

6.35e+00 

6.59e+00 

1 .12e-01 

5.88e+01 

16384,12 

2.55e-07 

7.58e+01 

3.75e+01 

7.64e-01 

4.91e+01 

65536,12 

2.69e-07 

7.63e+02 

3.73e+02 

4.39e+00 

8.49e+01 


Table 4: Numerical results for the composition of two FIOs. 


5 Conclusion and discussion 

This paper introduces a butterfly factorization as a data-sparse approximation of complementary 
low-rank matrices. More precisely, it represents such an N x N dense matrix as a product of 
0{logN) sparse matrices. The factorization can be built efficiently if either a fast algorithm for 
applying the matrix and its adjoint is available or an explicit expression for the entries of the matrix 
is given. The butterfly factorization gives rise to highly efficient matrix-vector multiplications 
with 0(N log N) operation and memory complexity. The butterfly factorization is also useful 
when an existing butterfly algorithm is repeatedly applied, because the application of the butterfly 
factorization is significantly faster than pre-existing butterfly algorithms. 

The method proposed here is a first step in computing data-sparse approximations for butterfly 
algorithms. As we mentioned earlier, the proposed butterfly factorization can be easily generalized 
to higher dimensions, which is especially relevant in imaging science. Another interesting direction 
is to invert a matrix via the butterfly factorization. While the numerical results of this paper 
include a couple of examples, it is natural to consider other important class of transforms, such 
as the non-uniform Fourier transform and the Legendre functions associated with the spherical 
harmonic transform. 
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